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Abstract 

This paper explores unsupervised learning of parsing models along two directions. First, 
which models are identifiable from infinite data? We use a general technique for numerically 
checking identifiability based on the rank of a Jacobian matrix, and apply it to several stan- 
dard constituency and dependency parsing models. Second, for identifiable models, how do 
we estimate the parameters efficiently? EM suffers from local optima, while recent work using 
spectral methods [T] cannot be directly applied since the topology of the parse tree varies across 
sentences. We develop a strategy, unmixing, which deals with this additional complexity for 
restricted classes of parsing models. 



1 Introduction 

Generative parsing models, which define joint distributions over sentences and their parse trees, 
are one of the core techniques in computational linguistics. We are interested in the unsupervised 
learning of these models [2j-[6] , where the goal is to estimate the model parameters given only 
examples of sentences. Unsupervised learning can fail for a number of reasons [7]: model misspec- 
ification, non-identifiability, estimation error, and computation error. In this paper, we delve into 
two of these issues: identifiability and computation. In doing so, we confront a central challenge 
of parsing models — that the topology of the parse tree is unobserved and varies across sentences. 
This is in contrast to standard phylogenetic models (§] and other latent tree models for which there 
is a single fixed global tree across all examples [9]. 

A model is identifiable if there is enough information in the data to pinpoint the parameters (up 
to some trivial equivalence class); establishing the identifiability of a model is often a highly non- 
trivial task. A classic result of Kruskal |10| has been employed to prove the identifiability of a wide 
class of latent variable models, including hidden Markov models and certain restricted mixtures of 
latent tree models [IT- 13 . However, these techniques cannot be directly applied to parsing models 



since the tree topology varies over an exponential set of possible topologies. Instead, we turn to 



techniques from algebraic geometry [14-17 ; we show that a simple numerical procedure can be 
used to check identifiability for a wide class of models in NLP. Using this tool, we discover that 
probabilistic context-free grammars (PCFGs) are non- identifiable, but that simpler PCFG variants 
and dependency models are identifiable. 

The most common way to estimate unsupervised parsing models is by using local techniques 



such as EM 18 or MCMC sampling 



19 , but these methods can suffer from local optima and slow 



mixing. Meanwhile, recent work |l||20-23 has shown that spectral methods can be used to estimate 
mixture models and HMMs with provable guarantees. These techniques express low-order moments 
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of the observable distribution as a product of matrix parameters and use eigenvalue decomposition 
to recover these matrices. However, these methods are not directly applicable to parsing models 
because the tree topology again varies non-trivially. To address this, we propose a new technique, 
unmixing. The main idea is to express moments of the observable distribution as a mixture over 
the possible topologies. For restricted parsing models, the moments for a fixed tree structure can 
be "unmixed", thereby reducing the problem to one with a fixed topology, which can be tackled 
using standard techniques [lj. Importantly, our unmixing technique does not require the training 
sentences be annotated with the tree topologies a priori, in contrast to recent extensions of [2l] to 
learning PCFGs |24| and dependency trees 25 26 , which work on a fixed topology. 



2 Notation 

For a positive integer n, define [n] = f {1, . . . , n} and (n) = {ei, . . . , e n }, where ej is the vector which 
is 1 in component i and elsewhere. For integers a,b 6 [n], let a <8> n b = (a — l)n + b £ [n 2 ] be 
the integer encoding of the pair (a, b). For a pair of matrices, A, B G M mxn , define the columnwise 
tensor product A ®° B £ W n xn to be such that (A (g> c B)^^ mi2 ^ = A^jB^j. For a matrix 
A £ M mxn , let A^ denote the Moore-Penrose pseudoinverse. 



3 Parsing models 

A sentence is a sequence of L words, x = {x\, . . . , xl), where each word Xi £ (d) is one of d possible 
word types. A (generative) parsing model defines a joint distribution ¥q(x,z) over a sentence x 
and its parse tree z (to be made precise later), where 6 are the model parameters (a collection of 
multinomials). Each parse tree z has a topology Topology(z) 6 Topologies, which is both unobserved 
and varying across sentences. The learning problem is to recover 6 given only samples of x. 

Two important classes of models of natural language syntax are constituency models, which 
represent a hierarchical grouping and labeling of the phrases of a sentence (e.g., Figure [T]^a)), and 
dependency models, which represent pairwise relationships between the words of a sentence (e.g., 
Figure Qb)). 

3.1 Constituency models 

A constituency tree z = (V, s) consists of a set of nodes V and a collection of hidden states s = 
{s v } V £v Each state s v £ (k) represents one of k possible syntactic categories. Each node v has 
the form [i : j] for < i < j < L corresponding to the phrase between positions i and j of the 
sentence. These nodes form a binary tree as follows: the root node is [0 : L] £ V, and for each node 
[i : j] £ V with j — i > 1, there exists a unique m with i < m < j defining the two children nodes 
[i : m] £ V and [m : j] £ V. Let Topology(z) be an integer encoding of V. 

PCFG. Perhaps the most well-known constituency parsing model is the probabilistic context-free 
grammar (PCFG). The parameters of a PCFG are 9 = (ir,B, O), where tt £ M. k specifies the initial 
state distribution, B £ M. k xk specifies the binary production distributions, and O £ U. dxk specifies 
the emission distributions. 
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Topology(z) = 4 



(a) Constituency (PCFG-IE) (b) Dependency (DEP-IE) 



Figure 1: The two constituency trees and seven dependency trees over L = 3 words, x\,X2,Xz. 
(a) A constituency tree consists of a hierarchical grouping of the words with a latent state z v for 
each node v. (b) A dependency tree consists of a collection of directed edges between the words. 
In both cases, we have labeled each edge from i to j with the parameters used to generate the state 
of node j given i. 

A PCFG corresponds to the following generative process (see Figure [ija) for an example): 
choose a topology Topology(z) uniformly at random Q generate the state of the root node using it; 
recursively generate pairs of children states given their parents using B; and finally generate words 
Xi given their parents using O. This generative process defines a joint probability over a sentence 
x and a parse tree z: 

L 

P fl (x, z) = | Topologies l' 1 ^ s [0:L] JJ (s [i:m] <g> fc s [m:j] ) T Bs [i:j] JJ xj Os {i _ 1:i] , (1) 

[i:m] , [m:j] 6 V i=l 

We will also consider two variants of the PCFG with additional restrictions: 

PCFG-I. The left and right children states are generated independently — that is, we have the 
following factorization: B = T\ (g> c T 2 for some T X ,T 2 G R kxk . 

PCFG-IE. The left and the right productions are independent and equal: B = T <8> c T. 
3.2 Dependency tree models 

In contrast to constituency trees, which posit internal nodes with latent states, dependency trees 
connect the words directly. A dependency tree z is a set of directed edges where i,j E [L] 

1 Usually a PCFG induces a topology via a state-dependent probability of choosing a binary production versus 
an emission. Our model is a restriction which corresponds to a state-independent probability. 
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are distinct positions in the sentence. Let Root(z) denote the position of the root node of z. We 
consider only projective dependency trees 127]: z is projective if for every path from i to j to k in 
z, we have that j and k are on the same side of i (that is, j — i and k — % have the same sign). Let 
Topology(z) be an integer encoding of z. 

DEP-I. We consider the simple dependency model of The parameters of this model are 
9 = (tt, Ay, A\), where 7r £ M. d is the initial word distribution and A y, ^4-^ £ ]R. dxd are the left and 
right argument distributions. The generative process is as follows: choose a topology Topology(z) 
uniformly at random, generate the root word using tt, and recursively generate argument words to 
the left to the right given the parent word using A y and A\, respectively. The corresponding joint 
probability distribution is as follows: 



where dir(i, j) = y* if j < i and \ if j > i. 
We also consider the following two variants: 

DEP-IE. The left and right argument distributions are equal: A = Ay = Ax^. 

DEP-IES. A = Ay = A\ and ir is the stationary distribution of A (that is, tt = Air). 

4 Identifiability 

Our goal is to estimate model parameters 9q € Q given only access to sentences x ~ Pg . Specifically, 
suppose we have an observation function <^(x) £ M. m , which is the only lens through which an 
algorithm can view the data. We ask a basic question: in the limit of infinite data, is it information- 
theoretically possible to identify 9q from the observed moments /j,(9q) = Eg [i^(x)]? 

To be more precise, define the equivalence class of 9q to be the set of parameters 9 that yield 
the same observed moments: 



It is impossible for an algorithm to distinguish among the elements of Sq(9q). Therefore, one 
might want to ensure that |5e($o)| = 1 for all 9q £ 0. However, this requirement is too strong for 
two reasons. First, models often have natural symmetries — e.g., the k states of any PCFG can be 
permuted without changing fi(9), so |5e(#o)| ^ k\. Second, |<Se(#o)| = 00 f° r some pathological 
#o's — e.g., PCFGs where all states have the same emission distribution O are indistinguishable re- 
gardless of the production distributions B. The following definition of identifiability accommodates 
these two exceptional cases: 

Definition 1 (Identifiability). A model family with parameter space is (globally) identifiable 
from (f> if there exists a measure zero set £ such that |<Se(0o)| * s finite for every 9q £ @\£. It is 
locally identifiable from (j) if there exists a measure zero set £ such that, for every 9q £ Q\£ , there 
exists an open neighborhood N(9q) around 9q such that Sq{9q) n N(0q) = {9q}. 



F e (-x,z) = | Topologies | 1 ir J x Root{z) xj A dir{iJ) 



(2) 



Sq(9 ) = {9 € G : M (0) = M (0„)}. 



(3) 



4 



Example of non-identifiability. Consider the DEP-IE model with L = 2 with the full obser- 
vation function 0(x) = x\ <%> X2- The corresponding observed moments are ^(6) = 0.5Adiag(7r) + 
0.5 diag(7r)^4 T . Note that ^4diag(-7r) is an arbitrary d x d matrix whose entries sum to 1, which 
has d 2 — 1 degrees of freedom, whereas fx{9) is a symmetric matrix whose entries sum to 1, which 
has - 1 de grees of freedom. Therefore, Sq(6) has dimension (2) and therefore the model is 

non-identifiable. 



Parameter counting. It is important to compute the degrees of freedom correctly — simple pa- 
rameter counting is insufficient. For example, consider the PCFG-IE model with L = 2. The 
observed moments with respect to </>(x) = x\ (g> X2 is a d x d matrix, which places d 2 constraints on 
the k 2 + (d — l)k parameters. When d > 2k, there are more constraints than parameters, but the 
PCFG-IE model with L = 2 is actually non-identifiable (as we will see later). The issue here is that 
the number of constraints does not reveal the fact that some of these constraints are redundant. 



4.1 Observation functions 

An observation function 0(x) and its associated observed moments (jl{6q) = Eg [</>(x)] reveals aspects 
of the distribution Pg (x). For example, <p{x) = x\ would only reveal the marginal distribution 
of the first word, whereas </>(x) = x\ (g) • ■ • (8> xi, reveals the entire distribution of x. There is 
a tradeoff: Higher-order moments provide more information, but are harder to estimate reliably 
given finite data, and are also computationally more expensive. In this paper, we consider the 
following intermediate moments: 

^12 (x) = f xi ® X2 0**(x) = f [xi <g> Xj [L]) 

0123 (x) = f x\ <g) X2 <8) X3 0***(x) = f (xi (8) Xj (g) Xfc : i, j, k G [L]) 

0i23»?(x) = f (xi (g) x 2 )(r] T x 3 ) </>***r?(x) d = ((xj ® Xj)(rj T x k ) : k £ [L\) 

0all(x) = f Xl <g) • • • <g> X L 

Above, rj G M. d denotes a unit vector in M. d (e.g., e\) which picks out a linear combination of matrix 
slices from a third-order d x d x d tensor. 



4.2 Automatically checking identifiability 

One immediate goal is to determine which models in Section [3] are identifiable from which of the 
observed moments (Section |4.1[). A powerful analytic tool that has been succesfully applied in 



previous work is Kruskal's theorem [10,11 , but (i) it is does not immediately apply to models with 
random topologies, and (ii) only gives sufficient conditions for identifiability, and cannot be used to 
determine non-identifiability. Furthermore, since it is common practice to explore many different 
models for a given problem in rapid succession, we would like to check identifiability quickly and 
reliably. In this section, we develop an automatic procedure to do this. 

To establish identifiability, let us examine the algebraic structure of 5e(#o) for #0 £ ©> where 
we assume that the parameter space © is an open subset of [0, l] n {^] Recall that 5e(6>o) is defined 



2 Whilc we initially defined 8 to be a tuple of conditional probability matrices, we will now use its non-redundant 
vectorized form 9 £ R™. 
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by the moment constraints fi(0) = h(6q). We can write these constraints as hg {9) = 0, where 



is a vector of m polynomials in 9. 

Let us now compute the number of degrees of freedom of hg around 9q. The key quantity is 
J{9) G W mxn , the Jacobian of hg at 9 (note that the Jacobian of hg does not depend on 9q; it 
is precisely the Jacobian of fi). This Jacobian criterion is well-established in algebraic geometry, 
and has been adopted in the statistical literature for testing model identifiability and other related 
properties 
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Intuitively, each row of J(9q) corresponds to a direction of a constraint violation, and thus the 
row space of J(6o) corresponds to all directions that would take us outside the equivalence class 
<5e($o)- If J(@o) has less than rank n, then there is a direction orthogonal to all the rows along 
which we can move and still satisfy all the constraints — in other words, |<Se(#o)| is infinite, and 
therefore the model is non-identifiable. This intuition leads to the following algorithm: 



CheckIdentifiability: 

1. Choose a point 9 6 uniformly at random. 

2. Compute the Jacobian matrix J (9). 

3. Return "yes" if the rank of J(9) = n and "no" otherwise. 

The following theorem asserts the correctness of CheckIdentifiability. It is largely based 
on techniques in [16], although we have not seen it explicitly stated in this form. 



Theorem 1 (Correctness of CheckIdentifiability). Assume the parameter space is a non- 
empty open connected subset of [0, l] n ; and the observed moments \i: M. n —> M. m , with respect to 
observation function <f>, is a polynomial map. Then with probability 1, CheckIdentifiability 
returns "yes" iff the model family is locally identifiable from 4>. Moreover, if it returns "yes", 
then there exists £ C of measure zero such that the model family with parameter space Q\£ is 
identifiable from (p. 

The proof of Theorem [l] is given in Appendix [A| 
4.3 Implementation of CheckIdentifiability 

Computing the Jacobian. The rows of J correspond to dKg[(pj(x)]/d9 and can be computed 
efficiently by adapting dynamic programs used in the E-step of an EM algorithm for parsing models. 
There are two main differences: (i) we must sum over possible values of x in addition to z, and (ii) we 
are not computing moments, but rather gradients thereof. Specifically, we adapt the CKY algorithm 



for constituency models and the algorithm of 27 for dependency models. See Appendix C.l for 
more details. 

Numerical issues. Because we implemented CheckIdentifiability on a finite precision ma- 
chine, the results are subject to numerical precision errors. However, we verified that our numerical 
results are consistent with various analytically-derived identifiability results (e.g., from |11|). 
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Model \ Observation function 


012 




0123ei 


0123 0***ei 0*** 


PCFG 


No, even from 4> a \\ for L S {3, 4, 5} 


PCFG-I / PCFG-IE 


No 


Yes iff L > 4 


Yes iff L > 3 


DEP-I 


No 


Yes iff L > 3 


DEP-IE / DEP-IES 


Yes iff L > 3 



Figure 2: Local identifiability of parsing models. These findings are given by 
CheckIdentifiability have the semantics from Theorem [T] These were checked for d G 
{2, 3,..., 8}, k G {2, ...,d} (applies only for PCFG models), L G {2,3, ...,9}. 

4.4 Identifiability of constituency and dependency tree models 

We checked the identifiability status of various constituency and dependency tree models using 
our implementation of CheckIdentifiability. We focus on the regime where d > k for PCFGs; 
additional results for d < k are given in Appendix [B| 

The results are reported in Figure [2} First, we found that the PCFG is not identifiable from 
a ii (and therefore not identifiable from any 4>) for L G {3,4,5}; we believe that the same holds 
for all L. This negative result motivates exploring restricted subclasses of PCFGs, such as PCFG- 
I and PCFG-IE, which factorize the binary productions^ For these classes, we found that the 
sentence length L and choice of observation function can influence identifiability: Both models are 
identifiable for large enough L (e.g., L > 3) and with a sufficiently rich observation function (e.g., 

<Pl23ri)- 

The dependency models, DEP-I and DEP-IE, were all found to be identifiable for L > 3 
from second-order moments The conditions for identifiability are less stringent than their 

constituency counterparts (PCFG-I and PCFG-IE), which is natural since dependency models are 
simpler without the latent states. Note that in all identifiable models, second-order moments suffice 
to determine the distribution — this is good news because low-order moments are easier to estimate. 

5 Unmixing algorithms 

Having established which parsing models are identifiable, we now turn to parameter estimation for 
these models. We will consider algorithms based on moment matching — those that try to find a 9 
satisfying fi(9) = u for some u. Typically, u is an empirical estimate of ^l(9q) = ^6» o [0( x )] based on 
samples x ~ Fg j^] 

In general, solving fi(9) = u corresponds to finding solutions to systems of multivariate polyno- 
mials, which is NP-hard [28] . However, fi(9) often has additional structure which we can exploit. 
For instance, for an HMM, the sliced third-order moments ^ii2Z^{9) can be written as a product of 
parameter matrices in 9, and each matrix can be recovered by decomposing the product [l]. 

For parsing models, the challenge is that the topology is random, so the moments is not a single 
product, but a mixture over products. To deal with this complication, we propose a new technique, 

3 Note that these subclasses occupy measure zero subsets of the PCFG parameter space, which is expected given 
the non-identifiability of the general PCFG. 

4 We will develop our algorithms assuming true moments (it = u{9o))- The empirical moments converge to the true 
moments at O p (n~3 ) ; and matrix perturbation arguments (e.g., [l]) can be used derive sample complexity arguments 
for the parameter error. 
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which we call unmixing: We "unmix" the products from the mixtures, essentially reducing the 
problem to one with a fixed topology. 



We will first present the general idea of unmixing (Section 5.1 ) and then apply it to the PCFG-IE 



model (Section 5.2) and the DEP-IES model (Section 5.3). 



5.1 General case 

We assume the observation function </>(x) consists of a collection of observation matrices {<^ o (x)} og 
(e.g., for o = </>o(x) = x% ® Xj). Given an observation matrix o (x) and a topology 

t G Topologies, consider the mapping that computes the observed moment conditioned on that 
topology: ^ 0<t (9) = Kq[4> (x.) | Topology = t]. As we range o over O and t over Topologies, we will 
enounter a finite number of such mappings. We call these mappings compound parameters, denoted 

Now write the observed moments as a weighted sum: 

(^VPtVopolog,^,.)^ for alio GO, (4) 

v ' 



Mo 

-4 op 



= M n 



where we have defined M op to be the probability mass over tree topologies that yield compound 
parameter ty p . We let {Mop} ^o,peP be the mixing matrix. Note that Q defines a system of 
equations \x = M^, where the variables are the compound parameters and the constraints are the 
observed moments. In a sense, we have replaced the original system of polynomial equations (in 9) 
with a system of linear equations (in \E'). 

The key to the utility of this technique is that the number of compound parameters can be 
polynomial in L even when the number of possible topologies is exponential in L. Previous ana- 



lytic techniques 13 based on Kruskal's theorem 10 cannot be applied here because the possible 



topologies are too many and too varied. 

Note that the mixing equation [i = holds for each sentence length L, but many compound 
parameters p appear in the equations of multiple L. Therefore, we can combine the equations 
across all observed sentence lengths, yielding a more constrained system than if we considered the 
equations of each L separately. 

The following proposition shows how we can recover 9 by unmixing the observed moments \i: 

Proposition 1 (Unmixing). Suppose that there exists an efficient base algorithm to recover 9 from 
some subset of compound parameters {^ p (9) : p G Vo}, and that ej is in the row space of M for 
each p G Vo- Then we can recover 9 as follows: 



Unmix(^): 

1. Compute the mixing matrix M 

2. Retrieve the compound parameters ^ p (9) = (M'[j,)p for each p E Vo- 

3. Call the base algorithm on {^ p (8) : p £ Vo} to obtain 9. 



For all our parsing models, M can be computed efficiently using dynamic programming (Ap- 



pendix C.2). Note that M is data- independent, so this computation can be done once in advance. 
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5.2 Application to the PCFG-IE model 

As a concrete example, consider the PCFG-IE model over L = 3 words. Write A = OT. For 
any r\ G M. d , we can express the observed moments as a sum over the two possible topologies in 
Figure [1J a): 

I^123 V = E[xi (8) x 2 (r] T x 3 ) 

H 132v = E[si ® x 3 (7? T x 2 )] = 0.5*3;^ + 0.5^ 2;r? , ^ 2; „ = A diag(vr)T 1 diag(4 1 r,)A 1 , 
d«f n?r_. f_.T_ m _n Kir. in C ,T, ,t, <M ^4 diag(4 T 77)T diag(vr)4 T , 







dof 


2;t?, 








def 


2;t?> 






^3;r, 


def 


l;rj) 





or compactly in matrix form: 








) 1 









0.5/ 0.5/ 

0.5/ 0.5/ 
0.5/ 0.5/ 



observed moments fi v mixing matrix M compound parameters 



Let us observe fi v at two different values of rj, say at 77 = 1 and 77 = r for some random r. Since 
the mixing matrix M is invertible, we can obtain the compound parameters "J^i = (M _1 /ii)2 and 

^2;r = (M-Vt) 2 - 

Now we will recover 9 from ^21 and \&2 ;T by first extracting A = OT via an eigenvalue decom- 
position, and then recovering it, T, and O in turn (all up to the same unknown permutation) via 
elementary matrix operations. 

For the first step, we will use the following tool (adapted from Algorithm A of [I]), which allow 
us to decompose two related matrix products: 

Lemma 1 (Spectral decomposition). Let Mi, M 2 G H dxk have full column rank and D be a diagonal 
matrix with distinct diagonal entries. Suppose we observe X = M\Mj and Y = MiDMj ■ Then 
Decompose(A, Y) recovers M\ up to a permutation and scaling of the columns. 



Decompose^, Y): 

1. Find TJ\,Ui G JR dx,c such that range(ZJi) = range(X) and range (C^) = 
range(X T ). 

2. Perform an eigenvalue decomposition of {UiYU2){Uj XL T 2)~ 1 = 
VSV' 1 . 

3. Return (Uj^V. 



First, run Decompose(JT = VP^Y = ^Jr) (Lemma [Tj), which corresponds to Mi = A and 
M 2 = 4diag(7r)T T . This produces AUS for some permutation matrix IT and diagonal scaling S. 
Since we know that the columns of A sum to one, we can identify 411. 

To recover the initial distribution it (up to permutation), take ^2;il = Air and left-multiply 
by (411) t to get n _1 7r. For T, put the entries of it in a diagonal matrix: II -1 diag(7r)Il. Take 
= 4Tdiag(vr)4 T and multiply by (411) t on the left and ((An) T )t(n- 1 diag(Tr)n)- 1 on the 
right, which yields n _1 TIT. (Note that II is orthogonal, so IT -1 = II T .) Finally, multiply 411 = 
OTU and (JL^TH)- 1 , which yields OIL 

The above algorithm identifies the PCFG-IE from only length 3 sentences. To exploit sentences 
of different lengths, we can compute a mixing matrix M which includes constraints from sentences 
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of length 1 < L < L max up to some upper bound L max . For example, L max = 10 results in a 
990 x 2376 mixing matrix. We can retrieve the same compound parameters (^2;i and ^2,t) from 
the pseudoinverse of M and as proceed as before. 

5.3 Application to the DEP-IES model 

We now turn to the DEP-IES model over L = 3 words. Our goal is to recover the parameters 
8 = (ir, A). Let D = diag(-7r) = diag(^47r), where the second equality is due to stationarity of ir. 



where E[-] is taken with respect to length 2 sentences. Having recovered n from ^i, it remains 
to recover A. By selectively combining the moments above, we can compute AA + A = [7(/ii3 — 
^12) + 2//12] diag( / ui) _1 . Assuming A is generic position, it is diagonalizable: A = QAQ^ 1 for some 
diagonal matrix A = diag(Ai, . . . , A^), possibly with complex entries. Therefore, we can recover 
A 2 + A = Q~ 1 (AA + A)Q. Since A is diagonal, we simply have d independent quadratic equations 
in Aj, which can be solved in closed form. After obtaining A, we retrieve A = QKQ~ l . 

6 Discussion 

In this work, we have shed some light on the identifiability of standard generative parsing models 
using our numerical identifiability checker. Given the ease with which this checker can be applied, 
we believe it should be a useful tool for analyzing more sophisticated models |6] , as well as developing 
new ones which are expressive yet identifiable. 

There is still a large gap between showing identifiability and developing explicit algorithms. We 
have made some progress on closing it with our unmixing technique, which can deal with models 
where the tree topology varies non-trivially. 



[1] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and 
hidden Markov models. In COLT, 2012. 

[2] F. Pereira and Y. Shabes. Inside-outside reestimation from partially bracketed corpora. In 
ACL, 1992. 

[3] G. Carroll and E. Charniak. Two experiments on learning probabilistic dependency grammars 
from corpora. In Workshop Notes for Statistically-Based NLP Techniques, AAAI, pages 1-13, 




= 7T 



aei ^ r i 
/il2 = 1E|21 <g> X2\ 

fii 3 d = E[xi (8) x 3 ] 
/ii2 = f E[xi (8) x 2 ] 



7~ 1 (DA ] +DA ] +DA l A l + AD + ADA 1 + AD + DA ), 
7~ 1 (DA T + DA T A T + DA T + ADA 1 + AD + A AD + AD), 
2~ 1 (DA T + AD), 



References 



1992. 



[4] M. A. Paskin. Grammatical bigrams. In NIPS, 2002. 



10 



D. Klein and C. D. Manning. Conditional structure versus conditional estimation in NLP 
models. In EMNLP, 2002. 

D. Klein and C. D. Manning. Corpus-based induction of syntactic structure: Models of de- 
pendency and constituency. In ACL, 2004. 

P. Liang and D. Klein. Analyzing the errors of unsupervised learning. In HLT/ACL, 2008. 

J. T. Chang. Full reconstruction of Markov models on evolutionary trees: Identiflability and 
consistency. Mathematical Biosciences, 137:51-73, 1996. 

A. Anandkumar, K. Chaudhuri, D. Hsu, S. M. Kakade, L. Song, and T. Zhang. Spectral 
methods for learning multivariate latent tree structure. In NIPS, 2011. 

J. B. Kruskal. Three-way arrays: Rank and uniqueness of trilinear decompositions, with 
application to arithmetic complexity and statistics. Linear Algebra and Applications, 18:95- 
138, 1977. 

E. S. Allman, C. Matias, and J. A. Rhodes. Identifiability of parameters in latent structure 
models with many observed variables. Annals of Statistics, 37:3099-3132, 2009. 

E. S. Allman, S. Petrovi, J. A. Rhodes, and S. Sullivant. Identifiability of 2-tree mixtures for 
group-based models. Transactions on Computational Biology and Bioinformatics, 8:710-722, 
2011. 

J. A. Rhodes and S. Sullivant. Identifiability of large phylogenetic mixture models. Bulletin 
of Mathematical Biology, 74(1):212-231, 2012. 

T. J. Rothenberg. Identification in parameteric models. Econometrica, 39:577-591, 1971. 

L. A. Goodman. Exploratory latent structure analysis using both identifiabile and unidentifi- 
able models. Biometrika, 61(2):215-231, 1974. 

D. Bamber and J. P. H. van Santen. How many parameters can a model have and still be 
testable? Journal of Mathematical Psychology, 29:443-473, 1985. 

D. Geiger, D. Heckerman, H. King, and C. Meek. Stratified exponential families: graphical 
models and model selection. Annals of Statistics, 29:505-529, 2001. 

K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the inside- 
outside algorithm. Computer Speech and Language, 4:35-56, 1990. 

M. Johnson, T. Griffiths, and S. Goldwater. Bayesian inference for PCFGs via Markov chain 
Monte Carlo. In HLT/NAACL, 2007. 

E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden Markov models. Annals 
of Applied Probability, 16(2):583-614, 2006. 

D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. 
In COLT, 2009. 



11 



[22] S. M. Siddiqi, B. Boots, and G. J. Gordon. Reduced-rank hidden Markov models. In AISTATS, 
2010. 

[23] A. Parikh, L. Song, and E. P. Xing. A spectral algorithm for latent tree graphical models. In 
ICML, 2011. 

[24] S. B. Cohen, K. Stratos, M. Collins, D. P. Foster, and L. Ungar. Spectral learning of latent- 
variable PCFGs. In ACL, 2012. 

[25] F. M. Luque, A. Quattoni, B. Balle, and X. Carreras. Spectral learning for non- deterministic 
dependency parsing. In EACL, 2012. 

[26] P. Dhillon, J. Rodue, M. Collins, D. P. Foster, and L. Ungar. Spectral dependency parsing 
with latent variables. In EMNLP-CoNLL, 2012. 

[27] J. Eisner. Three new probabilistic models for dependency parsing: An exploration. In COL- 
ING, 1996. 

[28] S. Sahni. Computationally related problems. SIAM Journal on Computing, 3:262-279, 1974. 

[29] J. Eisner. Bilexical grammars and their cubic-time parsing algorithms. In Advances in Prob- 
abilistic and Other Parsing Technologies, pages 29-62, 2000. 

A Proof of Theorem [T] 

Theorem 1 (restated). Assume Q is a non-empty open connected subset of [0, l] n and [x: W 1 — > 
M. m is a polynomial map. With probability 1, the following holds. 

• CheckIdentifiability returns "no" =>• for almost all 9$ G G and any open neighborhood 
N{9q) around 0q, \S@(9q) H N(8q)\ is infinite (not locally identifiable). 

• CheckIdentifiability returns "yes" =>• (i) for almost all 9q G Q, there exists an open 
neighborhood N(8q) around 9q such that \Sq(Q$) n N{9q)\ = 1 (locally identifiable); and (ii) 
there exists a set £ C G with measure zero such that \Sq\p{9q)\ is finite for every 9q G Q \ £ 
(identifiability ofO\£). 

The proof of Theorem [I] crucially relies on the following lemma from 16 which holds even 
in the case that [i is merely an analytic function (see Lemma 9 of 17 for a simpler proof in 



the case \i is a polynomial map); it states that the Jacobian achieves its maximal rank almost 
everywhere in G. To state this precisely, first define r max = f max{rank( J(0)) : 9 € Q} and 
G max = f {9 G G : rank(J(0)) = r max }. 

Lemma 2. The set G \ G 

max has Lebesgue measure zero. That is, G max is almost all of®. 

Proof of Theorem^ By Lemma [ij CheckIdentifiability chooses a point 9 G G max with proba- 
bility 1. We henceforth condition on this event, so rank(J(#)) = r max . 

Case 1: rank(J(#)) < n {i.e., "no" is returned). In this case, we have r max < n. We now employ 
an argument from the proof of Proposition 20 of |16| . Fix any #o G G max . Since Q is open, Weyl's 
theorem implies that there is an open neighborhood U around #o in G on which rank( J (9)) = r 



max 
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for all 9 G U (i.e., rank(J(-)) is constant on U). Therefore, by the constant rank theorem, there is 
an open neighborhood N(9q) around 9q in such that /i _1 (/x(#o)) H N(9q) is homeomorphic with 
an open set in M n_T ' max . Therefore Sq(9q) n N(6q) is uncountably infinite. 

Case 2: rank( J(6>)) = n (i.e., "yes" is returned). In this case, we have r max = n. Therefore 
for every #0 G m ax> the Jacobian J(9q) has full column rank, and thus by the inverse function 
theorem, fj, is injective on a neighborhood of Oq. This in turn implies that for all 9q G O m ax, there 
exists an open neighborhood N(9q) around 9q such that <Se(#o) n N(9q) = {9q}. This proves (i). 

def 

To show (ii), define £ = \ ©max, and now claim that for every 9q G ©max, the equivalence 
class 5e max (#o) is finite. Observe that by (i), the set 5e max (#o) contains only geometrically isolated 
solutions to the system of polynomial equations given by /i(9) = h(6q). Therefore the claim fol- 
lows immediately from Bezout's Theorem, which implies that the number of geometrically isolated 
solutions is finite. □ 



Remark. All the models considered in this paper have moments fi which correspond to a polyno- 
mial map. However, for some models (e.g., exponential families), [i will not be a polynomial map, 
but rather, a general analytic function. In this case, Theorem [l] holds with one modification to (ii). 
If CheckIdentifiability returns "yes" , then we have the following weaker guarantee in place of 
(ii): <Se max (#o) i s countable (but not necessarily finite) for all 9q G ©max- The above proof does not 
require the fact that /i is a polynomial map except in the invocation of Bezout's Theorem. In place 
of Bezout's Theorem, we use the following argument. If 5e max (#o) ^ s uncountable, then it contains 
a limit point 9* G 5e max (#o); thus for any small enough neighborhood N(9*) of 9*, there is some 
9 G 5e max (#o) H N(9*). This contradicts (i) as applied to 9* , and thus we conclude that 5e max (6*o) 
is countable. 



B Additional results from the identifiability checker 

PCFG models with d < k. The PCFG models that we've considered so far assume that the 
number of words d is at least the number of hidden states k, which is a realistic assumption 
for natural language. However, there are applications, e.g., computational biology, where the 
vocabulary size d is relatively small. In this regime, identifiability becomes trickier because the 
data doesn't reveal as much about the hidden states, and brings us closer to the boundary between 
identifiability and non-identifiability. In this section, we consider the d < k regime. 

The following table gives additional identifiability results from CheckIdentifiability for val- 



ues of d, k, and L where d < k (recall that the results reported in Section 4.4 only considered 
values where d> k). In each cell, we show the (k,d,L) values for which CheckIdentifiability 
returned "yes" ; the values checked were k G {3, 4, . . . , 8}, d G {2, . . . , k — 1}, L G {3, 4, . . . , 9}. 
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PCFG 



gl23 

None 



PCFG-I 



None 



> 6) 

> 

> 5) 

> 6) 

> 4) 

> 7) 

> 5) 

> 4) 
> 

> 6) 

> 5) 

> 4) 



None 



(5,4,>4) 
(6,5,>4) 
(7,5,>4) 
(7,6,>4) 



(3,2,>5) 
(4,2,>6) 
(4,3,>4) 
(5,2,>7) 

(5, > 3, > 4) 
(6,2,>8) 
(6,3,> 5) 

(6, > 4, > 4) 
(7,2,>9) 
(7,3,> 5) 

(7, > 4, > 4) 



PCFG-IE 



None 



> 6) 

> 

> 5) 

> 6) 

> 5) 

> 7) 

> 5) 

> 4) 
> 

> 6) 

> 5) 

> 4) 



(5,4,>4) 
(6,5,>4) 
(7,5,>5) 
(7,6,>4) 



(4,3,>4) 
(5,4,>4) 
(6,> 4,> 4) 
(7,>5,>4) 



2,>5) 

2, >6) 

3, >4) 

2, >7) 

3, >5) 
5,4,>4) 

6.2, > 8) 

6.3, > 5) 
(6,>4,>4) 

7.2, >9) 

7.3, >5) 
(7, > 4, > 4) 



(3,2,>5) 
(4,2,>6) 
(4,3,>4) 
(5,2,>7) 

(5,>3,>4) 
(6,2,>8) 
(6,3,>5) 

(6,>4,>4) 
(7,2,>9) 
(7,3,>5) 

(7,>4,>4) 



Fixed topology models. We now present some results for latent class models (LCMs) and 
hidden Markov models (HMMs). While identifiability for these models are more developed than 
for parsing models, we show that the identifiability checker can refine the results even for the classic 
models. 

The parameters of an HMM are 6 = (ir,T,0), where tt G M fc specifies the initial state distri- 
bution, T 6 M. kxk specifies the state transition probabilities, and O G M. dxk specifies the emission 
distributions. The probability over a sentence x is: 

Pe(x) = l T Tdiag(0 T rE L )---Tdiag(0 T x 2 )Tdiag(0 T xi)7r. (5) 

The parameters of an LCM are 9 = (-7T, O) — the same as that of an HMM except with T = I. The 
probability over a sentence x is also given by ^ (with T = I). 

The following table summarizes some identifiability results obtained by CheckIdentifiability 
(for d> k); these results have all been proven analytically in previous work (e.g., [8| [To|[TT|[20|[2T] ) 
except for the identifiability of HMMs from <^>**. 





012 




0123ei 0123 0***ei 0*** 


LCM 


No 


Yes iff L > 3 


HMM 


No 


Yes iff L > 3 
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It is known that LCMs are not identifiable from 0** for any value of L [8]. However, LCMs 
constitute a subfamily of HMMs arising from a measure zero subset of the HMM parameter space. 
Therefore the identifiability of HMMs from (for L > 3) does not contradict this result. The 
result does not appear to be covered by application of Kruskal's theorem in previous work 11 , so 
we prove the result rigorously below. 

It can be checked using ([5} that 

E e [0i 2 (x)] = Odiag(vr)T T T 
W 34 (x)] = Odiag(T^)T T T . 

Let Mi d = O, M 2 d = OT diag(vr), and D d = diag(T^) diag(vr)- 1 . Provided that 

1. vr > 0, 

2. O has full column rank, 

3. T is invertible, 

4. the ratios of probabilities (Tir)i/iri, ranging over i 6 [k], are distinct 

(all of which are true for all but a measure zero set of parameters in O), the matrices M\ and M 2 
have full column rank and the diagonal matrix D has distinct diagonal entries. Therefore Lemma [Tj 
can be applied with X = E e [0i 2 (x)] = M±Mj and Y = E e [0 34 (x)] = MiDMj to recover Ml = O. 
It is easy to see that ir and T can also easily be recovered. 

Note that the fourth condition above, that Tit be entry- wise distinct from tt, is violated when 
a LCM distribution is cast as an HMM distribution (by setting T = I so Ttt = tt). However, the 
set of HMM parameters satisfying this equation is a measure zero set. 



Discussion. CheckIdentifiability tests for local identifiability. If it finds that a model family 
is not locally identifiable, then it is not globally identifiable. However the inverse claim is not 
necessarily true: if it finds that a model family is locally identifiable, it is not necessarily globally 
identifiable. Theorem [T] provides the somewhat weaker guarantee that a restricted model family is 
globally identifiable, where the equivalence classes Sq\s(Oo) are only taken with respect to a subset 
@ \ £ Q © of the parameter space. However, there is a gap between this property (which is with 
respect to O \ £ ) and true global identifiability (which is with respect to 0). 

On the other hand, having explicit estimators guarantees us proper global identifiability with 
respect to the original model family 0. In fact, the exceptional set £ can typically be characterized 
explicitly. For instance, in the case of PCFG-IE, the set \ £ contains those 9 = (ir, T, O) that 
satisfy full rank conditions: 

\ £ = {(tt, T, O) : 7r >- 0, T is invertible, O has full column rank}. (6) 

Additionally, the explicit estimators also provides an explicit characterization of the elements 
in the equivalence class 5e(#o) f° r each 0q 6 \ £: the set S®(6q) contains exactly k\ elements 
corresponding to permutation of the hidden states. Specifically, 

5 e ((vr, T, O)) = {(n _1 vr, n^TTI, Oil) : U is a permutation matrix. (7) 

Note that this is shaper than Theorem [TJ which only says that the equivalence classes have to be 
finite. 
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C Dynamic programs 



For a sentence of length L, the number of parse trees is exponential in L. Therefore, dynamic 
programming is often employed to efficiently compute expectations over the parse trees, the core 
computation in the E-step of the EM algorithm. In the case of PCFG, this dynamic program is 
referred to as the CKY algorithm, which runs in 0(L 3 k 3 ) time, where k is the number of hidden 



states. For simple dependency models, a O(L^) dynamic program was developed by 29 . At a 
high-level, the states of the dynamic program in both cases are the spans [i : j] of the sentence (and 
for the PCFG, the these states include the hidden states Zu.j\ of the nodes). 

In this paper, we need to compute (i) the Jacobian matrix for checking identifiability (Sec- 
tion 



4.2) and (ii) the mixing matrix for recovering compound parameters (Section |5.1[ ). Both 
computations can be performed efficiently with a modified version of the classic dynamic programs, 
which we will describe in this section. 

C.l Computing the Jacobian matrix 

Recall that the j-th row of the Jacobian matrix J is (the transpose of) the gradient of hj{9) = 
Hj(0) — fjLj(9o). Specifically, entry Jji is the derivative of the j-th moment with respect to the i-th 
parameter: 

j _dh j (9) 



5E e [^(x)] 
ddi 

d9i 



(9) 



We can encode the sum over the exponential set of possible sentences x and parse trees z using 
a directed acyclic hypergraph so that each hyperpath through the hypergraph corresponds to a 
(x, z) pair. Specifically, a hypergraph consists of the following: 

• a set of nodes V with a designated start node Start G V and an end node End 6 V, and 

• a set of hyperedges £ where each hyperedge e £ £ has a source node e.a G V and a pair 
of target nodes (e.b, e.c) 6 V x V (we say that e connects e.a to e.b and e.c) and an index 
e.i E [n] corresponding to a component of the parameter vector 9 G W 1 . 

Define a hyperpath P to be a subset of the edges £ such that: 

• (Start, a, b) G P for some a, b G V; 

• if (a, b,c) G P and b ^ End, then (b, d,e) G P for some d, e G V; and 

• if (a, b,c) G P and c ^ End, then (c, d, e) G P for some d, e G V. 

Each hyperpath P, encoding (x, z), is associated with a probability equal to the product of all 
of the parameters on that hyperpath: 

Pg {x,z)=p e (P) = Y[9 eA . (11) 

eeP 
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In this way, the hypergraph compactly defines a distribution over exponentially many hyperpaths. 

Now, we assume that each moment <j>j{x) corresponds to a function fj : £ \— > R mapping each 
hyperedge e to a real number so that the moment is equal to the product over function values: 

<M*)=ru-( e )' ( i2 ) 

eGP 

where P is any hyperpath that encodes the sentence x and some parse tree z (we assume that the 
product is the same no matter what z is). 

Now, let us write out the Jacobian matrix entries in terms of hyperpaths: 

J ^ = EE^T II °^Me). (13) 

P e eP % eeP,e^e 

The sum over hyperpaths P can be computed efficiently as follows. For each hypergraph node 
a, we compute an inside score a(a), which sums over all possible partial hyperpaths terminating at 
the target node, and an outside score (3(a), which sums over all possible partial hyperpaths from 
the source node: 

a(a) d = ^2 O e .iCt(e.b)a(e.c), (14) 

eG£ :e.a=a 

(3{a) = Y, Oe.ia(e-c)P(e.a) Y ^a(e.6)/3(e.a). (15) 

e&£:e.b=a e££:e.c=a 

The Jacobian entry Jji can be computed as follows: 

Jji = P(e.a)a(e.b)a(e.c)l[i = e.i]. (16) 

Example: PCFG. For a PCFG, nodes V have the form (i,j,s) £ [L] x [L] x [k], corresponding 
to a hidden state s over span [i : j]. For each hidden state s, we have a hyperedge e connecting 
e.a = Start to e.b = (s, 0, L) and e.c = End; this hyperedge has parameter index e.i corresponding 
to 7r s . For each span [i : j] with j — i > 1, split point i < m < j , and hidden states si, S2, S3 £ [k], 
£ contains a hyperedge e connecting e.a = (i,j,si) to e.b = (i,m,S2) and e.c = (m,j, S3); the 
parameter index e.i corresponds to the binary production B^ S2( ^ kS3 ^ Sl . For each span [i — 1 : i], 
hidden state s £ [fe] and word x £ [d], we have a hyperedge e connecting e.a = (i — l,i,s) to 
e.6 = End and e.c = End with parameter index e.i corresponding to the emission O xs . 

The moments can be encoded as follows: For example, if 4>j{x.) = I[xj = t], then we define fj(e) 
to be if the source node corresponds to position i (e.a = (i — l,i, s)) and the parameter index e.i 
does not correspond to Ot s for some s £ [A;], and 1 otherwise. In this way, Yl e ^p fj(e) is zero if P 
encodes a sentence with Xi 7^ t. 

Higher-order moments simply correspond to hyperedge-wise multiplication of these first-order 
moments. For example, if ^(x) = Ifx^ = t±] and 4 , j 2 {~ x ) = I[^i 2 = ^2], then the second-order 
moment <j>j(x) = Ifx^ = t\,Xi 2 = ^2] corresponds to fj(e) = fj 1 (e)fj 2 (e). 
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7T 7T 




Topology(z) = 1 Topology(z) = 2 



Figure 3: An example of a backbone structure in blue corresponding to the compound parameter 
OTdiag(T-7r)T T T , which appears in two different topologies, for two observation matrices, <pi2 
and $23, respectively. 



C.2 Computing the mixing matrix 

Recall that the mixing matrix M includes a row for each observation matrix o £ O and a column for 
each compound parameter p £ V . Assuming a uniform distribution over topologies, computing each 
entry of M reduces to counting the number of topologies t consistent with a particular compound 
parameter fy p : 

M op = P(^ ,Topology = *p) (17) 

= | Topologies I" 1 Yypo,t = %]■ (18) 
t 

First, we will characterize the set of compound parameters graphically in terms of backbone 
structures. As an example, consider the PCFG-IE model and the observation matrix <j>\2 (o = 12) 
corresponding to the marginal distribution over the first two words of the sentence. Given a topology 
t, consider starting at the root, descending to the lowest common ancestor of x\ and X2, and then 
following both paths down to x\ and X2, respectively. We refer to this traversal as the backbone 
structure with respect to topology t and observation matrix (pi2- See Figure [3] for an example of 
the backbone structure, outlined in blue. 

Note that the compound parameter ^\2,t(Q) = ^e[4>i2{x) \ Topology = t] can be written as a 
product over the parameter matrices, one for each edge of the backbone structure. For Figure |3j 
this would yield 

5r 12) i(0) = OTdiag(rvr)r T T . (19) 

For general trees, we would have 

# 12jt (0) = OT" 1 diag(T n3 7r)(T T ) n2 T . (20) 

for some positive integers m, ri2, corresponding to the number of edges (in t) from the common 
node to the preterminal node zqi, the preterminal node z\2, and the root zql, respectively. 

Note that the compound parameter does not depend on the structure of t outside the backbone — 
that part of the topology is effectively marginalized out — so the compound parameter ^12 1(^) will 
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be identical for all topologies sharing that same backbone structure. Therefore, there are only a 
polynomial number of compound parameters despite an exponential number of topologies ij^] 

We define a dynamic program that recursively computes M op for the PCFG-IE model under 
a fixed second-order observation matrix 4>i j - Specifically, for each span [i : j] define H(i,j) to 
be the set of pairs {t, n) where t is a partial backbone structure t and n is the number of partial 
topologies over span [i : j] which are consistent with t. 

In the base case H(i— 1, i), if i is either of the designated leaf positions defined by the observation 
matrix (io or jo), then we return the single-node backbone structure •; otherwise, we return the 
null backbone structure 0: 

tf(*- M ) = ( {(0 ' 1)} if * = *oor* = j ( 
I {(•,].)} otherwise. 

In the recursive case H(i,j), we consider all split points m, partial backbones t\ and t 2 from 
H(i,m) and H(m,j), respectively, and create a new tree with t\ and/or t 2 as the subtrees if they 
are not null: 

+ + + 

H (hj)= U U U (Combine^, t 2 ),mn 2 ) , (22) 

i<m<j (t 1 ,n. 1 )e-ff(i,r?i) (t2,n 2 )£H(m,j) 

(T : ti,T : t 2 ) if h / and t 2 7^ 0, 
T : ti if t x £ 0, 



COMBINE^, t 2 ) 



(23) 

T:t 2 ifi 2 ^0, 
otherwise. 



Here, we use the notation |J + to denote a multi-set union: {(t,ni)}U + {(t,n 2 )} = {(t,ni + n 2 )}. In 
this notation, the backbone structure in Figure [3] would be represented asT: (T : O : »,T : O : •), 
which can be easily converted to the compound parameter OT diag(Tir)T T O t . 

For third-order observation matrices (e.g., 4>i j k ri), we add an additional case to H(i — to 
return (o, 1) if i = A; ; note that fco is represented by a special node o because that observation is 
projected using rj. The first case of Combine^, £2) undergoes one change: if t 2 is a chain ending in 
o, then we return (T : t 2 ,T : ti). The reason for this is best demonstrated by an example: consider 
topology 1 in Figure |3j and the two observation matrices 4>i3 2ri and 023irj- Without the reordering, 
we would have the backbone structure: (T : (T : »,T : o),T : •) and (T : (T : o,T : »),T : 
•). However, they have the same compound parameter OT diag(T T O t r])T T diag(-7r)TO. This is 
because the contribution of a subtree ending in o is simply a diagonal matrix (diag (T t O t t]) in this 
case) which is applied on the hidden state regardless of whether it came from the left or right side. 



One might also see why the unmixing technique does not directly apply to the PCFG-I model, where T is 
replaced with Ti for left edges and T2 for right edges. In that case, there are many backbone structures (and thus 
more compound parameters) due to the different interleavings of left and right edges. 
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